Written by Savahnna L. Cunningham

Date: October 13, 2017

The Red Wine dataset is publicly available for research. The details are described in [Cortez et al., 2009].

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at:

Introduction

The goal of this analysis is to quantify and gain an understanding of how chemical properties impact the quality rating of red wine. The dataset contains 1599 red wine samples with 11 variables, quantifying the physicochemical properties of each wine. The wine samples in this dataset are related to red variants of the Portuguese “Vinho Verde” wine.

A multiple regression analysis will be conducted on the dataset to test how changes in the 11 independent physicochemical properties predict a level of change in the quality rating of a wine. The f-test will be used to determine which predictor variables merit inclusion in the model.

The statistical hypotheses for this analysis are as follows:

H0 (Null Hypothesis): Combinations of the 11 independent physicochemical properties (μI) have no relationship in predicting the outcome of the dependent quality rating of a wine (μD), which can be mathematically represented as

H0: μI = μD

H1 (Alternate Hypothesis): Two or more of the 11 independent physicochemical properties (μI) predict the outcome of the dependent quality rating of a wine (μD), which can be mathematically represented as

HA: μI > μD

Attribute information:

   Input variables (based on physicochemical tests):
   
   1 - fixed acidity (tartaric acid - g / dm^3)
   
   2 - volatile acidity (acetic acid - g / dm^3)
   
   3 - citric acid (g / dm^3)
   
   4 - residual sugar (g / dm^3)
   
   5 - chlorides (sodium chloride - g / dm^3
   
   6 - free sulfur dioxide (mg / dm^3)
   
   7 - total sulfur dioxide (mg / dm^3)
   
   8 - density (g / cm^3)
   
   9 - pH
   
   10 - sulphates (potassium sulphate - g / dm3)
   
   11 - alcohol (% by volume)
   
   Output variable (based on sensory data): 
   
   12 - quality (score between 0 and 10)

Description of attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

   Output variable (based on sensory data): 
12 - quality (score between 0 and 10)

Methodology

Univariate Plots & Analysis

Summary table representing the 13 variable names. The X1 column represents the wine ID. The ‘quality’ variable is the dependent variable and is qualitative data based on a perceived like or dislike for the wine sample.

##        X1         fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##                                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.43       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##  NA's   :2                                                             
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000  
## 

Normal Distribution Plots

The following variables have a normal or close-to-normal distribution: fixed.acidity, volatile.acidity, density, pH and alcohol content with the exception of citric acid, which has a bimodal distribution.

Abnormal Distributions

The following variables do not have a normal or close-to-normal distribution: residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and sulphates. The variables have right-skewed distributions, thus a logarithmic function will be used to transform the data to get a better understanding of their respective distributions [4].

Original Data Distribution

Logarithmic Distribution

As you can see, the logarithmic function helped clean up the distributions. The Free Sulfur Dioxide variable is uniquely distributed with a near bimodal distribution.

Synopsis

The red wine dataset contains 1599 red wine samples comprised of 11 physiochemical variables that affect a wine’s perceived quality. The main features of interest are the 11 variables and how they correlate to a wine’s quality.

There were 5 physiochemical variables that had abnormal distributions. A logarithmic function was used to better understanding of the distributions.

The next step in this analysis with be to investigate the relationship quality has with the physiochemical variables. The variable quality is of numeric type ‘int’ and not conducive for data analysis. The first step will be to change the numeric type of the quality variable to a factor and add it to the data frame as a new variable quality.rating. Additionally, three categories of quality will be added: good (>= 7), bad (<=4), and mediocre (5 and 6).


Bivariate Plots & Analysis

Density appears to have a small positive correlation with acids. Additionally, pH has an inverse relationship with the acids, which is to be expected.

The following pairs of independent variables have a strong correlation (>0.5):

  • free.sulfur.dioxide vs total sulfur.dioxide
  • fixed acidity vs density
  • fixed acidity vs pH
  • fixed acidity vs citric acid

Exploratory Data Analysis

The exploratory data analysis will focus on the relationship between the independent variables and the dependent quality rating variable.

Creating a Categorical Variable

# Gained inspiration for this code from the R-Bloggers website[6&7].

wine$quality.rating <- factor(wine$quality)
wine$quality.rating <- NA
wine$quality.rating <- ifelse(wine$quality>=7, 'good', NA)
wine$quality.rating <- ifelse(wine$quality<=4, 'bad', wine$quality.rating)
wine$quality.rating<- ifelse(wine$quality==5, 'mediocre', wine$quality.rating)
wine$quality.rating <- ifelse(wine$quality==6, 'mediocre', wine$quality.rating)

wine$quality.rating <- factor(wine$quality.rating, levels = c("bad", "mediocre", "good"))

Independent Variables vs. Quality Rating: A Positive Linear Relationship

The visualization indicates good wine contains a higher percentage of alcohol, averaging ~12% by volume.

The visualization indicates good wine contains a higher quantity of fixed acidity. As you can see, >8 g/dm³ appears to be the threshold for a wine to be considered good.

The visualization indicates good wine contains a higher quantity of citric acid. As you can see, the quality greatly improves if a wine is has a citric acid content range between 0.25 - 0.5 g/dm³.

The visualization indicates there is not a large separation from a good verses a bad wine. As you can see, it appears as if a good wine will contain a sulphate concentration of ~0.75 g/dm³.


Independent Variables vs. Quality Rating: A Negative Linear Relationship

The visualization indicates that the greater amount of volatile acidity a wine contains, the worse the quality rating. To have good marks, a wine is considered good if it contains <0.4 g/dm³ acidic acid.

The visualization indicates that all wine samples are close to the density of water, however the wine samples with a good quality rating have a slightly lower density, with a value approximately equal to 0.995 g/cm³.

The visualization indicates that the mediocre and good wine samples are very similar in pH of <= 3.4, while the wine samples with a bad quality rating have a value of >= 3.4.


pH and the Non-Volatile Acid Variables

The visualization represents the inverse relationship between pH and the weak acids found in wine. Substances with a pH below 7.0 are termed acidic and solutions with a pH above 7.0 are termed basic. As you can see, the red wine samples as a whole are considered an acidic solution. As pH goes up, the less acidic the wine becomes.


Density and the Non-Volatile Acid Variables

The visualization represents the affect acidity has on wine density. Acid molecules are creating a stronger, closely packed bond compared to the surrounding substance. Therefore, as acid molecules increase, the density of the wine also increases.

Synopsis

The Bivariant analysis depicts notable relationships between wine quality and the physiochemical characteristics. As you can see from the boxplots above, there is a positive correlation between fixed acid, citric acid levels and wine quality. The higher the non-volatile acid level, the better the wine quality. Additionally, because acetic acid produces a vinegar taste, a negative correlation can be found between the volatile acid variable and wine quality.

A good wine has the lowest density, which makes sense because density has a direct correlation with total acidity concentration. However, it is interesting to point out that there seems to be a fine line between total acidity level and pH value. For a wine to be considered good, it has to have a low volatile acidity level in conjunction with higher citric acid and fixed acid concentrations but overall total acid levels should not pass a pH value of ~3.3.


Multivariate Plots & Analysis

The visualization compares the free Sulfur dioxide and the total Sulfur dioxide variables to the dependent quality rating variable. As you can see, the majority of “good” wine is located in the lower left quadrant and will have a free Sulfur dioxide of <20 mg / dm^3 and a total Sulfur dioxide concentration of <100 mg / dm^3.

The visualization compares the Fixed Acidity variable against pH value with based on the dependent quality rating variable. As you can see, there is an inverse relationship between pH and fixed acidity. This is to be expected, as pH levels rise as acidity level decreases.

The visualization compares the Fixed Acidity and Citric Acid variables with the wine quality rating. Results indicate a positive linear relationship between the independent variables. The majority of “good” wine has a Fixed Acidity concentration of less than 9 g / dm^3 and a Citric Acid concentration less than 0.25 g / dm^3.

The visualization compares the Fixed Acidity and the Density variables with the wine quality rating. Results indicate a strong positive linear relationship between the independent variables. There does not appear to be any correlative relationship between the independent and dependent variables.


Mathematical Model

The goal of the multiple linear regression model is to predict wine quality based on the chemical properties of a wine sample.

# Multiple Linear Regression
dataset = read.csv('wineQualityReds.csv')
dataset = dataset[, 2:13]

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$quality, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)


# Note: Feature_Scaling will be taken care of with the function 

# Fitting Multiple Linear Regression to the Training set
regressor = lm(formula = quality ~ .,
               data = training_set)
summary(regressor)
## 
## Call:
## lm(formula = quality ~ ., data = training_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.66781 -0.36656 -0.06195  0.45616  1.96562 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.471e+01  2.370e+01   0.621 0.534945    
## fixed.acidity         2.265e-02  2.878e-02   0.787 0.431501    
## volatile.acidity     -9.534e-01  1.347e-01  -7.078 2.41e-12 ***
## citric.acid          -1.259e-01  1.619e-01  -0.778 0.436697    
## residual.sugar        1.043e-02  1.627e-02   0.641 0.521547    
## chlorides            -1.932e+00  4.586e-01  -4.213 2.70e-05 ***
## free.sulfur.dioxide   3.379e-03  2.487e-03   1.359 0.174485    
## total.sulfur.dioxide -3.005e-03  8.114e-04  -3.704 0.000222 ***
## density              -1.067e+01  2.418e+01  -0.441 0.659225    
## pH                   -4.486e-01  2.161e-01  -2.075 0.038143 *  
## sulphates             8.889e-01  1.311e-01   6.778 1.86e-11 ***
## alcohol               2.917e-01  2.975e-02   9.804  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6519 on 1266 degrees of freedom
## Multiple R-squared:  0.3519, Adjusted R-squared:  0.3462 
## F-statistic: 62.48 on 11 and 1266 DF,  p-value: < 2.2e-16
#Predicting the Test set results
y_pred = predict.lm(regressor, newdata = test_set,interval = "prediction",level = 0.95)

p1 <- smoothScatter(y_pred,pch = ".", cex = 5, 
                     col = "black",colramp = 
                     colorRampPalette(c("white", blues9)),
                     xlab = "Fit", 
                     ylab = "Model Prediction",
                     main="Predicted Future Values ")

The visualization represents the 95% prediction interval with data points representing the models predicted values. As you can see, the model did very well predicting the wine quality value, as all data points are within the prediction interval [10].

# Plot a correlation matrix
regressor= cor(test_set[1:12])

par(mar=c(5,4,1.5,2) + 0.1)  #margin padding  
p1 <- corrplot(regressor, method = "circle",tl.cex = 0.6) + title(main= "Regression Model Correlation Matrix",cex.main = 1.3) 

The volatile acidity, chlorides, total sulfur dioxide, alcohol, sulphates have strong statistical significance on the depandent variable, while pH has a slight statistical influence on quality. The model did very well, now it is time to optimize it with the Backward Elimination method.

Model Optimization

# Building the optimal model using Backward Elimination
regressor = lm(formula = quality ~ fixed.acidity + 
                 volatile.acidity + 
                 citric.acid + 
                 residual.sugar + 
                 chlorides +
                 free.sulfur.dioxide + 
                 total.sulfur.dioxide + 
                 density + 
                 pH +
                 sulphates +
                 alcohol,
               data = dataset)  
summary(regressor)
## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68911 -0.36652 -0.04699  0.45202  2.02498 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.197e+01  2.119e+01   1.036   0.3002    
## fixed.acidity         2.499e-02  2.595e-02   0.963   0.3357    
## volatile.acidity     -1.084e+00  1.211e-01  -8.948  < 2e-16 ***
## citric.acid          -1.826e-01  1.472e-01  -1.240   0.2150    
## residual.sugar        1.633e-02  1.500e-02   1.089   0.2765    
## chlorides            -1.874e+00  4.193e-01  -4.470 8.37e-06 ***
## free.sulfur.dioxide   4.361e-03  2.171e-03   2.009   0.0447 *  
## total.sulfur.dioxide -3.265e-03  7.287e-04  -4.480 8.00e-06 ***
## density              -1.788e+01  2.163e+01  -0.827   0.4086    
## pH                   -4.137e-01  1.916e-01  -2.159   0.0310 *  
## sulphates             9.163e-01  1.143e-01   8.014 2.13e-15 ***
## alcohol               2.762e-01  2.648e-02  10.429  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared:  0.3606, Adjusted R-squared:  0.3561 
## F-statistic: 81.35 on 11 and 1587 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + total.sulfur.dioxide + 
##     pH + sulphates + alcohol, data = dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.60575 -0.35883 -0.04806  0.46079  1.95643 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.2957316  0.3995603  10.751  < 2e-16 ***
## volatile.acidity     -1.0381945  0.1004270 -10.338  < 2e-16 ***
## chlorides            -2.0022839  0.3980757  -5.030 5.46e-07 ***
## total.sulfur.dioxide -0.0023721  0.0005064  -4.684 3.05e-06 ***
## pH                   -0.4351830  0.1160368  -3.750 0.000183 ***
## sulphates             0.8886802  0.1100419   8.076 1.31e-15 ***
## alcohol               0.2906738  0.0168108  17.291  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6487 on 1592 degrees of freedom
## Multiple R-squared:  0.3572, Adjusted R-squared:  0.3548 
## F-statistic: 147.4 on 6 and 1592 DF,  p-value: < 2.2e-16

Synopsis

The Multivariate analysis reveals strong statistical correlations with six of the independent physicochemical properties. The scatterplot visualizations indicate that “good” wine will have low concentrations Citric Acid, Tartaric Acid, total Sulfur dioxide and free Sulfur dioxide.

An optimized multiple linear regression model using the Backward Elimination method discovered alcohol, volatile acidity, sulphates, total sulfur dioxide, chlorides and pH have a very strong statistical influence on wine quality.


Results

The univariate analysis revealed six independent physicochemical properties have normal or close-to-normal distributions, while the remaining five properties have right-skewed distributions, requiring a logarithmic function be used to better understand the distributions.

The bivariate analysis revealed a positive linear relation between the independent physicochemical properties alcohol, fixed acidity, citric acid and sulphates and the dependent variable. A negative linear relationship exists between volatile acidity, density, pH and the quality rating. An inverse relationship exists between pH and the acids, as pH levels rise as acid levels decrease. Additionally, there is a positive correlation between density and the acids due the chemical properties that exist between an acid molecule and the surrounding substance. Strong correlations (>0.5) were discovered between free Sulfur dioxide and total Sulfur dioxide, fixed acidity and density, fixed acidity and pH, as well as fixed acidity and citric acid.

A multiple linear regression model was used on a test set of 321 wine samples, containing 11 independent variables to predict wine quality. The model performed very well with a 95% Confidence Interval, p-value <2.2e-16, residual standard error of 0.6519 on 1266 degrees of freedom, and a F-statistic equal to 62.48 on 11 variables and 1266 DF, concluding the 11 variables account for 35.48% of the variance in wine quality.

A second multiple linear regression model utilizing the Backward Elimination method was conducted on a test set of 321 wine samples to optimize the predictor variables to determine which variables have the strongest statistical relationship with the dependent variable.

Optimized Multiple Linear Regression Summary:

The Backward Elimination method indicates alcohol, volatile acidity, sulphates, total sulfur dioxide, chlorides and pH physicochemical properties have a very strong statistical influence on wine quality with a 95% Confidence Interval, a p-value <2.2e-16 and a residual standard error of 0.6487 on 1592 degrees of freedom. These six physicochemical properties account for 34.95% of the variance of wine quality. The high F-statistic equal to 147.4 and small p-value of < 2.2e-16 gives sufficient statistical evidence that the six independent variables predict the quality rating of wine, therefore, the Null Hypothesis can be rejected.


Final Plots and Summary

Plot One

The visualizations represent the distribution of the dependent variable analyzed in the dataset. The plot on the left is a histogram of a wine samples raw quality score. As you can see, most of the wine samples have a score between 5 and 6. The raw quality data was transformed into a categorical data points to better analyze the information. Score with values of 4 or less were labeled as “bad”“, score between 5-6 were labeled as”mediocre" and score with a 7 or higher were labeled “good”. The visualization on the right is depicts the categorical distribution of the quality score. As you can see, nearly all wine samples fall into the mediocre category with “good” samples having ~250 samples in the dataset and “bad” wine being the least common.

Plot Two

A multiple linear regression model was conducted on the dataset using the backward elimination method. The findings indicate six independent variables have a high statistical influence (p < 0.05) on the quality of a wine. A violin plot was used to visualize the descriptive statistics of each influential variable. Take notice that in some cases such as total Sulphur dioxide, chlorides, and pH the distance between a “good” vs. “bad” wine is minute. However, the three independent variables with the greatest statistical influence, alcohol, volatile acidity and sulphates, do have a noticeable distance in mean quality rating values. The results indicate that a good wine will have a high alcohol percentage, low volatile acid concentration and a Potassium sulphate concentration of ~15 g/dm³.

Plot Three

The visualization was created to compare the two independent variables with the highest statistical influence on wine quality. As you can see, most “good” wine is located in the upper left quadrant of the graph, while “bad” wine is more dispersed but has a majority in the lower right quadrant. In summary, “good” wine will contain a low concentration of Acetic Acid, ideally <0.5 g / dm^3 and an alcohol content >12% by volume.

Summary

The exploratory data analysis revealed the distributions of the 11 independent variables, as well as the interactions the physicochemical properties have with each other. The multivariate analysis focused on the independent variables with strong correlations (>0.5), results showing the fixed acidity variable, with three relations, has the greatest number of correlative influence on other independent variables.

The dependent wine quality variable has a normal distribution with most samples having a 5-6 quality score. The alcohol content and volatile acidity concentration have the strongest statistical influence on the dependent variable with sulphates, total sulfur dioxide, chlorides and pH also having an influence on the quality of a wine. Interestingly, when the two most influential variables, alcohol content and volatile acidity, are compared with the wine quality rating variable results show “good” wine has an alcohol content >12% by volume and an Acetic Acid concentration <0.5 g / dm3. Future work on this dataset should include exploring the outliers in this analysis. Why does a wine with a high alcohol percentage and a high Acetic Acid concentration still considered a “good” wine? Is there a unique combination of physicochemical properties within these samples, which lead to these abnormal quality ratings?

The mathematical algorithm chosen for this dataset was a multiple linear regression model. The results of the mathematic model indicate the two most influential variables on a wine’s quality are the alcohol content and volatile acidity concentration with four other also having an influence. However, the model is not without limitations. The model is built with limited data; for example the wine quality scores only range from 3-8. A greater sample size with a wider range in quality scores would significantly improve the model’s robustness. Additionally, the dependent variable is a qualitative measure base on a wine judge’s opinion. Would the quality scores change if the wine samples were given to a population of amateurs to judge? In summary, a greater sample size with a wider range in quality scores and a quantitative wine rating system would greatly improve the algorithms predictive power of the model.


Reflection

This analysis used a multiple linear regression model to account for 34.95% of the variance of wine quality. To improve the predictive power of the mathematical algorithm additional data with a wider spread of quality data should be used to improve performance results. Moreover, additional predictive models should be employed, such as Support Vector Machine (SVM), Decision Tree Regression or K-Nearest Neighbors (KNN) to provide more accurate predictions for a wine’s quality as a function of the independent physicochemical properties.

The multiple linear regression model determined the independent physicochemical properties with the highest statistical influence on wine quality are alcohol percentage, volatile acidity, sulphates, total sulfur dioxide, chlorides and pH. Sulphates are added to wine and act as an antimicrobial and antioxidant, signifying good wines will have a Potassium sulphate concentration of ~0.6 g / dm^3. Furthermore, it was discovered good wines contain low quantities of chlorides, total Sulfur dioxide and pH. The independent variables that have the maximum statistical influence on a wine quality are volatile acidity and alcohol percentage. Therefore, good wines will consist of a high alcohol percentage and a low concentration of volatile acids, which give wines an unpleasant, vinegar taste. This analysis exposes a strong correlative relationship between the physicochemical properties of alcohol and volatile acidity, thus, demonstrating the importance of a wine to be free of imperfections.


References

1. Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, José Reis,
    Modeling wine preferences by data mining from physicochemical properties, 
    In Decision Support Systems, Volume 47, Issue 4, 2009, Pages 547-553, ISSN 0167-9236,                                        https://doi.org/10.1016/j.dss.2009.05.016. 
    (http://www.sciencedirect.com/science/article/pii/S0167923609001377)
  
  2. Dataset link: http://www3.dsi.uminho.pt/pcortez/dss09.bib
  
  3. http://r4stats.com/examples/graphics-ggplot2/
  
  4. http://datadrivenjournalism.net/resources/when_should_i_use_logarithmic_scales_in_my_charts_and_graphs
  
  5.https://www.r-bloggers.com/multiple-regression-lines-in-ggpairs/
  
  6.https://stat.ethz.ch/R-manual/R-devel/library/base/html/levels.html
      
  7. https://www.r-bloggers.com/from-continuous-to-categorical/
  
  8.http://www.shonscience.com/unit-1-earth-as-a-system2/does-the-shape-size-or-temperature-of-matter-affect-its-density
  
  9. https://machinelearningmastery.com/pre-process-your-dataset-in-r/
  
  10. http://www.stat.columbia.edu/~martin/W2024/R6.pdf
  
  11. http://data.library.virginia.edu/diagnostic-plots/
  
  12. https://www.stat.berkeley.edu/classes/s133/Lr.html